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Abstract 


There is growing interest in self-optimizing comput- 
ing systems that can optimize their own behavior on 
different platforms without manual intervention. Ex- 
amples of successful self-optimizing systems are AT- 
LAS, which generates Basic Linear Algebra Subroutine 
(BLAS) Libraries, and FFTW, which generates FFT li- 
braries. 

Self-optimizing systems need values for hardware 
parameters such as the number of registers of various 
types and the capacities of caches at various levels. For 
example, ATLAS uses the capacity of the Ll cache and 
the number of registers in determining the size of cache 
tiles and register tiles. 

In this paper, we describe X-Ray', a system for im- 
plementing micro-benchmarks to measure such hard- 
ware parameters. We also present novel algorithms 
for measuring some of these parameters. Experimen- 
tal evaluations of X-Ray on traditional workstations, 
servers and embedded systems show that X-Ray pro- 
duces more accurate and complete results than existing 
tools. 


1. Introduction 


There is growing interest in self-optimizing systems 
that can optimize their own behavior on different plat- 
forms without manual intervention [9, 3, 6]. These 
systems are based on the generate-and-test paradigm: 
instead of writing a program, one implements a pro- 
gram generator that produces a large number of pro- 
gram variants, and determines empirically which vari- 
ant performs best. To prevent a combinatorial explosion 
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in the number of program variants that have to be con- 
sidered, self-optimizing systems bound the search space 
by using hardware parameters values such as the num- 
ber of registers and the capacity of the L1 cache [9, 10]. 

For software to be truly self-optimizing, the values 
of hardware parameters relevant for software optimiza- 
tion must be determined automatically. It is important 
to note that these values are not necessarily the same as 
the values one might find in a hardware manual. For ex- 
ample, loop unrolling must take into account the num- 
ber of registers available to hold values computed in the 
loop body. However, most compilers set aside certain 
registers for holding special values such as the stack or 
frame pointer, so the number of registers available to 
the register allocator is usually less than the total num- 
ber of architected registers. In practice, it is hard to find 
documentation even for hardware parameter values, let 
alone for values relevant to software optimization. 

In this paper, we describe X-Ray, a framework for 
implementing micro-benchmarks to measure relevant 
values of hardware parameters automatically. For porta- 
bility, X-Ray is entirely implemented in ANSI C’89. 
One of the interesting challenges of this approach is to 
ensure that the C compiler does not perform any high- 
level restructuring optimizations on our benchmarks 
that might pollute the timing results, while performance 
critical optimizations, such as register allocation, are 
still enabled. 


2. The X-Ray Framework 


Hardware parameters are measured by X-Ray micro- 
benchmarks. Figure 1 presents the general structure of 
a micro-benchmark in the X-Ray framework. 

As an example, consider the measurement of the 
number of available registers of a particular data type T. 
One way to determine this value is to perform a number 
of experiments, all of which perform the same computa- 
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Figure 1. A micro-benchmark in X-Ray 


tions but on a different number of variables (N) of type 
T. When N exceeds the number of available registers 
for type T, not all variables can be register allocated, 
and execution time should increase substantially. The 
number of available registers can be inferred from this 
cross-over point. 

Some general conclusions can be drawn from this 
example. A micro-benchmark to determine the value of 
some parameter may need to time a number of differ- 
ent but related programs that we call nano-benchmarks. 
Since there may be no a priori bound on the number of 
required nano-benchmarks, we need a Nano-benchmark 
Generator, which can produce Nano-benchmark C 
Code from a high-level Nano-benchmark Specifica- 
tion. Finally, generation should happen on-the-fly since 
the results of one nano-benchmark may determine the 
nano-benchmark to be executed next. 

In X-Ray, the execution of a micro-benchmark is 
orchestrated by its Control Engine, which chooses the 
nano-benchmarks to execute, the order in which they 
should be executed, and the appropriate parameters for 
each one. The Control Engine determines the value of 
the hardware parameter based on these timing results. 

Some micro-benchmarks may also need the results 
obtained from running other micro-benchmarks. For 
example, to determine the latency of an instruction in 
cycles rather than in nanoseconds, the control engine 
needs to know the cycle time of the processor. This can 
be specified by the user or it can be measured by another 
micro-benchmark, as discussed in Section 3. 


2.1. Nano-benchmarks 


Even with access to a high-resolution timer, it is hard 
to accurately time operations that take only a few CPU 
cycles to execute. Suppose we want to measure the time 
required to execute a C statement S. If this time is small 
compared to the granularity of the timer, we must mea- 
sure the time required to execute this statement some 
number of times Rs (dependent on S), and divide that 
time by Rs. If Rs is too small, the time for execution 
cannot be measured accurately, whereas if Rg is too 


big, the experiment will take longer than it needs to. 


Rs — 1; 

while (measures (Rg) < tmin) 
Rs — Rs x 2; 

return (measures (Rg) + Rs); 


Figure 2. Nano-benchmark timing 


Figure 2 shows the timing strategy used in X-Ray 
nano-benchmarks. In this code, measures (Rg) 
measures the time required to execute Rg repetitions of 
statement S. To determine a reasonable value for Rs, 
the code in Figure 2 starts by setting Rs to 1, and then 
doubles it until the experiment runs for at least tmin sec- 
onds. The value of tmin can be specified by the user and 
defaults to 0.25 seconds in the current implementation. 

A simplistic implementation of measure s is shown 
in Figure 3(a). This code incurs considerable loop over- 
head, so we unroll the loop U times (Figure 3(b)). 

Another problem is that restructuring compiler op- 
timizations may corrupt the experiment. For exam- 
ple, consider the case when we want to measure the 
latency of a single addition. In our framework, we 
would measure the time taken to execute the C state- 
ment po = po + pı. It is important to allocate py and 
pı in registers, but it is crucial that the compiler not re- 
place the U statements in the loop body by the statement 
Po = Po + U x pı, since this would prevent the code 
from timing the original statement correctly. 

To solve such problems, we need to generate pro- 
grams which the compiler can aggressively optimize 
without disrupting the sequence of operations whose 
execution time we want to measure. We solve this prob- 
lem using a switch statement on a volatile vari- 
able v as shown in Figure 3(c). The semantics of C 
require that v be read from memory; therefore the com- 
piler cannot assume anything about which case of the 
switch is selected. Because there is potential control 
flow to each of the case blocks, it is impossible for the 
compiler to combine or reorder them in any way. 

The final problem is that if the compiler is able to 
deduce that the result of the computations performed 
in S is not used in the rest of the code, it might per- 
form dead-code elimination and remove all instances of 
S altogether. To prevent this unwanted optimization, all 
variables that appear in S are assigned to values read 
from appropriately typed volatile variables in the 
initialize statement; similarly, their final values 
are copied back to the same volatile variables in 
the use statement. 

As we will see in Section 3, there are cases where we 
wish to measure the performance of a sequence of dif- 
ferent statements S1,S3,...,S,. To prevent the com- 
piler from optimizing this sequence, the code generator 
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Figure 3. Implementation of measures 


will give each S; a different case label, generating code 
of the form shown in Figure 3(d). In this figure, the 
number of case labels W is the smallest multiple of n 
greater than or equal to U. 


2.2. Nano-benchmark Generator 


The X-Ray nano-benchmark generator accepts as 
an input a nano-benchmark specification and produces 
nano-benchmark C code structured as shown in Fig- 
ures 3(c),3(d). 

The nano-benchmark specification is a tuple which 
contains a statement S to be timed and type infor- 
mation for all variables in S. For example, to mea- 
sure the latency of double-precision floating point ADD 
operation, we use the nano-benchmark specification 
(pı = pı + po, (pı, p2 : F64)), which means that we 
time the statement pı = pı + p2, where pı and pa 
are variables of type double (defined as F64 in X-Ray). 
Given this specification, the nano-benchmark generator 
can produce code as shown in Figure 3(c). Generating 
code of the form shown in Figure 3(d) is more complex 
and requires the first element of the tuple to be a func- 
tion f : integer — string, which computes the code for 
statement S; from the case label i. 


2.3. Implementing a new micro-benchmark 


As we will see in Section 3, implementing a new 
micro-benchmark in X-Ray requires: 


1. Implementing the nano-benchmarks for all tim- 
ing experiments. If their code fits the template 
in Figure 3(d), nano-benchmark specifications are 
enough; 


2. Implementing the micro-benchmark control en- 
gine to describe which nano-benchmarks to run, 
with what parameters, in what order, and how to 
produce a final result from the external parameters 
and the timings. 


3. CPU Micro-benchmarks 
3.1. CPU Frequency 


CPU frequency (Fcpu) is an important hardware pa- 
rameter because other parameters are measured relative 
to it (in clock cycles). X-Ray assumes that dependent 
integer additions can be executed at the rate of one per 
cycle, which is valid for most current processors. The 
assumption of dependence is important because mod- 
ern architectures can often issue two or more indepen- 
dent integer addition operations in one cycle, so timing 
independent addition operations would be misleading. 

For this micro-benchmark we use a nano-benchmark 
with specification S = (po = po + pı, (po, pı : int)). 
Given the time time (S) in nanseconds required to ex- 
ecute the statement S, we compute the CPU frequency 
in MHz as Fcpy — 1000 + time (S). 

As we will see in Section 5, the assumption that de- 
pendent integer additions are executed at the rate of one 


per cycle may not be correct for some processors. In 
that case, all timing measurements reported by X-Ray 
must be scaled by an appropriate constant to obtain the 
actual values. This is not a serious problem since self- 
optimizing software uses timing measurements mostly 
to choose between different code sequences, so relative 
rather than absolute times are needed. 


3.2. Instruction Latency 


The latency Lor of an operation (instruction) O, 
with operands of type T, is the number of cycles af- 
ter one such instruction is dispatched until its result be- 
comes available to subsequent dependent instructions. 

We use a nano-benchmark with specification 
Sor = (po = O(po, p1), (po, pı : T)). We then 
compute the instruction latency in clock cycles as 
Lot — time (Sor) + (1000 + Fcpu). 


3.3. Instruction Throughput 


The throughput TPo 7 of an operation (instruction) 
O, with operands of types T, is the rate in cycles at 
which the CPU can issue independent instructions of 
that type. On modern processors the throughput of an 
instruction is usually much smaller than its latency, be- 
cause of pipelining and super-scalar execution. 

To measure TPo y, we could use a nano-benchmark 
specification as follows. 

So,T,N,B = 
< 
{ 
= O(P(ix B+0)%N> PN); 
= O(P(ix B41)%N> PN); 


P(ixB+0)%N 
PUxB+1)%N 


= O(PpUixB4+B-1)%N> PN); 


} PGixB+B-1)%N 
(po,P1,---,PN : T) 
> 


Note that this specification generates code of the 
form shown in Figure 3(d). It is further parameterized 
by N and B, which control the number of independent 
instructions to generate. For example, the sequence of 


statements generated for N = 3 and B = 1 is the fol- 
lowing. 


case 0 : (po = O(po, p3); } 
case 1 : {p1 = O(p1,p3);) 
case 2 : {p2 = O(p2, p3); } 
case 3 : {po = O(po, p3); } 
case W : {p2 = O(po, ps); } 


In general, we generate B simple statement per case 
label because otherwise we cannot measure ILP for sta- 
tically scheduled VLIW cores. We then measure the 
instruction throughput in clock cycles as follows. 


N <— 2; 
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The nano-benchmark code for So T,w,B exhibits 
instruction- level parallelism (ILP) on the order of N x 
B. The control engine times the nano-benchmark for 
B = 1 and successively growing values of N while 
performance continues to increase due to the additional 
ILP. When the performance levels off for some N, the 
control engine starts growing B to check if increasing 
ILP between case labels improves performance. 


3.4. Instruction Existence 


The existence of certain instructions can influ- 
ence the code produced by some self-optimizing sys- 
tems; for example, ATLAS exploits the existence of 
fused multiply-add (FMA). We determine whether a 
fused multiply-add instruction exists by comparing the 
throughput of a simple multiply with that of a fused 
multiply-add. Similarly, many embedded processors do 
not have dedicated floating-point hardware, but use an 
emulation library instead. In X-Ray, we measure the 
latency of a floating-point ADD, and assert that a hard- 
ware floating-point unit exists if the latency is less than 
10 cycles. 


3.5. Number of Registers 


To measure the number of registers NR+y of type T, 
we use a nano-benchmark with specification Spy = 


(PIAN = PIAN + Pi+N—1)%N (Do, P1,---,PN :T)). 
For example, the sequence of statements generated for 


N = 4 is as follows. 
case 0 : po = Po + p3; 
case 1 : pı = pi + po; 
case 2 : p2 = p2 + p1; 
case 3: p3 = p3 + Po; 
case 4: po = po + p3; 


case W : p3 = p3 + Po; 


If all of p; are allocated in registers, the time per op- 
eration is much smaller than when some are allocated in 
memory. The goal is to determine the maximum J, for 
which no variables are allocated to memory. The con- 
trol engine doubles N until it observes a drop in perfor- 
mance. After that it performs a binary search in the in- 
terval [N + 2, N). The actual control engine algorithm 
is as follows. 


N — 4; 


while sela) <1+ :) 


time ST,2 

N -N x2; 
R-N; 
L-J 

while (R — L > 1) 

P o E, 


time | Sp p 
if ( - ) <l+e 
a ST,2 ) 


Re=P; 
else 
L — P; 
NRr = L; 


3.6. SMP and SMT 


To measure the number of processors in a SMP ar- 
chitecture, X-Ray uses the throughput nano-benchmark 
of Section 3.3 with specification SADD,132,N,B, Where 
N and B are the values for which maximum throughput 
1s achieved. The number p of threads running concur- 
rent instances of this configuration that exhibit no slow- 
down compared to running a single thread characterizes 
the number of physical processors in a SMP. Reading 
the number of CPUs with an OS call returns the num- 
ber v of virtual SMT processors. The SMT per CPU of 
the system is computed as 2. To find which two virtual 
processors share the same physical processor, X-Ray 
executes instances of the configuration concurrently on 
both. If there is no slowdown, the two virtual processors 
do not share a physical processor. 


4. Cache Micro-benchmarks 


In this section we summarize our approach for mea- 
suring memory hierarchy parameters. A full description 
along with detailed proofs is given in [11]. The most 
well-known benchmark for measuring memory hierar- 
chy parameters is the Saavedra benchmark [7], but the 
timing results are usually inspected manually to deter- 
mine memory hierarchy parameters. Other approaches 
use hardware counters [1, 2], but these are not very 
portable. Our approach produces the hardware parame- 
ter values directly; moreover, our results are accurate 
even for arbitrary cache associativity, cache exclusion, 
and hardware stride prefetching. 

We focus on measuring associativity, block size, ca- 
pacity, and hit latency [4] of caches. The first three 
parameters are sometimes referred to as the (A, B,C) 
of caches, while the last parameter will be referred to 
as Init. The description of the algorithms given below 
makes use of a parameter T = g that we call the stride 
of the cache. 


4.1. Sequences and compact sequences 


X-Ray determines memory hierarchy parameters by 
measuring the average time / to repeatedly access the 
elements of different address sequences. When each ac- 
cess is a cache hit, l = [pi is relatively small, and we 
say that the sequence is compact. When each access is a 
cache miss, l = l,, ¡ss is relatively large, and we say that 
the sequence is non-compact. Sequences which are nei- 
ther compact nor non-compact we call semi-compact. 

To measure the capacity and the associativity of the 
L1 data cache, X-Ray uses sequences of N addresses, 
where successive addresses are separated by a stride 
S = 2°. Such sequences are completely characterized 
by their starting address mo, stride S and number of el- 
ements N. We use the notation (mo, S, NY to represent 
them. 

Theorem | describes the necessary and sufficient 
conditions for compactness and non-compactness of a 
sequence of this type for a given cache. Informally, this 
theorem says that as the stride S gets bigger, the max- 
imum length of a compact sequence with stride S de- 
creases until it bottoms out at A, while the minimum 
length of a non-compact sequence with stride S de- 
creases until it bottoms out at A + 1. 


Theorem 1. Consider a cache with parameters 
(A, B,C) and a sequence W = (mo, S, N}. 


(a) W is compact iff N < Ne = A ES 
(b) W is non-compact iff N > Nne = (A + 1) [5] 


Proof. Omitted. O 
4.2. Measuring L1 Cache Parameters 


Cache Latency 

We determine the cache hit latency larit by measuring 
the average time to repeatedly access the elements of 
(mo, 1, 1), which is obviously compact. 

Capacity and Associativity 

Theorem 1 suggests a method for determining the 
capacity C and the associativity A. First, we find A 
by determining the asymptotic limit of the length of a 
compact sequence as the stride is increased. The small- 
est value of the stride for which this limit is reached is 
T, the stride of the cache; once we know A and T, we 
can find C. 

Pseudo-code for measuring C and A of the L1 data 
cache is shown below. We use is_compact (W) to em- 
pirically determine if W is compact by comparing the 
average time to repeatedly access the elements of W 
with the cache hit latency [),;z. 


Sl; 
Nel; 
while (is.compact ((mo, S, N))) 
N=2xN; 
repeat 
S—2xS; 
Nota — N; 
N — min N’ € [1, Nora] : “is.compact (mo, s; N'Y); 


until (N = Noia); 
A=N-1; 


C $ x 4A; 


The algorithm can be described as follows. Start 
with the sequence (mo, S, N} = (mo, 1, 1), which is 
compact, and keep doubling N until the sequence is 
not compact. Let Noza be the first N for which this 
happens. Now start doubling the stride S, and for each 
S compute the smallest N for which (mo, S, N} is not 
compact. This value of N can be found by using bi- 
nary search in the interval [1, Nora]. If N # Noa, let 
Nota = N and recompute N for the next S. Repeat this 
step until N = Noa. At this point, declare A = N — 1 
and C = 3x A. 

Block Size 


For a cache with stride T and associativity A, the 
sequence (mo, T, 24) is non-compact since all 24 ad- 
dresses map to the same cache set. This sequence can 
also be expressed as (mo, T, A) U (mo + C,T, A). If 
we offset the second half of the sequence by a con- 
stant ô as shown in Figure 4, we get a set of addresses 
D = (mo, T, A) U (mo + C + ô, T, A). 


Figure 4. Sequence for measuring B 


The addresses in each of the two subsequences map 
to a single cache set. When 0 < ô < B this is the same 
cache set and D is non-compact, and when ô > B the 
cache sets are different and D is compact. Pseudo-code 
for the algorithm is shown below. 


ô- 1 

while (~is-compact ((mo, T, A) U (mo + C + ô, T, A))) 
ô 2x06; 

return ĝ; 


4.3. Measuring Parameters for Lower Levels 


We can use the algorithms in Section 4.2 to measure 
parameters of a lower level cache l, provided we en- 
sure that the memory accesses miss in all higher level 
caches ¿ < l. We accomplish this by using sequences 
of sequences. For lack of space, we omit the details and 
refer the interested reader to a companion paper [11]. 


4.4. Implementation of ¡s_compact 


We represent our sequences of memory addresses 
with arrays of pointers (void x) instead of arrays of 
integers (int) as in the Saavedra benchmark. We ini- 
tialize the array in such a way that each element con- 
tains the address of the element which should be ac- 
cessed immediately after it. A local variable p is initial- 
ized with the address of the element which should be 
accessed first. 

For a correct implementation it is important to re- 
peatedly access all elements of the sequence, but the 
order in which we access them is irrelevant. To prevent 
hardware constant stride prefetchers, from interfering 
with our timings, we initialize the array elements by 
chaining the pointers so that we visit the elements in 
a pseudo-random order. 

We perform the timings using a nano-benchmark 
with specification (p = *(void * *)p, (p : voidx)). 
The fact that we can use the same nano-benchmark gen- 
erator for measuring both CPU parameter values and 
cache parameter values demonstrates the flexibility of 
the X-Ray architecture. 


5. Experimental Results 


In this section, we present experimental results ob- 
tained by using X-Ray to measure the hardware para- 
meters of a number of desktop and embedded platforms. 
Embedded processors are particularly challenging be- 
cause there are many variations even within a single 
processor family (in fact, some companies like Tensilica 
make customizable embedded processors). 

We compare our results to the actual values of the 
hardware parameters, as well as to the values obtained 
by Imbench v3.0-a4 [5]. We were unable to build or 
run Imbench on some of the architectures. Tables 1 
and 2 show a summary of the experimental results for 
CPU features and the memory hierarchy respectively. 
In these tables we use the following special keywords: 


e lexist — a micro-benchmark for measuring this 
hardware parameter does not exist in Imbench; 

e los — Lower level caches are physically addressed 
on all modern machines so we found it necessary 
to use super-pages to obtain consistent measure- 
ments of lower level cache parameters. Support 
for super-pages is very OS-specific, so we targeted 
Linux as a proof of concept. We are currently 
working on the implementation for Solaris, IRIX 
and AIX. Similarly Imbench relies on various OS 
features, which were not available on some of the 
platforms. 
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e ? — we could not obtain official information about 
the actual value of this hardware parameter. 


X-Ray and Imbench measure some hardware para- 
meters in different units. To allow direct comparison, 
we normalized Imbench results as follows. 


e Imbench measures the processor clock cycle Cimb 
and various latencies limb in nanoseconds. We 
compute the processor frequency in MHz as 
fimo = 1000 ~ Cimb, and latency in cycles as 
limb + Cimb- 

e Instead of measuring instruction throughput, Im- 
bench measures available instruction parallelism 
Pimb- We compute instruction throughput in cycles 
as timp = limb + Pimp, Where limo is the latency in 
cycles of the corresponding instruction, computed 
as shown above. 


We now discuss some of the more interesting results. 
5.1. UltraSPARC IIli and R12000 


X-Ray measured all parameters accurately on both 
architectures. Lmbench measured all parameters it sup- 
ports accurately on the R12000, but gave less accurate 


results on the UltraSPARC Illi, especially for instruc- 
tion latency and throughput. 


5.2. Power 3 


X-Ray detected an integer fused multiply-add in- 
struction although there is not one in the ISA. We 
verified that even though our measurement sequence 
(r0=r0+r0xr0) is translated into separate dependent 
MULTIPLY and ADD instructions, the hardware can 
achieve the same throughput as if there was no ADD. 
Therefore, the multiply-add sequence can be used to 
generate high-performance code for this architecture. 
X-Ray measured the data cache as 129-way set asso- 
ciative instead of 128-way set associative. This resulted 
in capacity of 64.5KB compared to the documented one 
of 64KB. Lmbench gave accurate results on the para- 
meters it is able to measure. 


5.3. Pentium 4 Xeon 


These processors feature two double-pumped integer 
ALUs, which led X-Ray to believe that the frequency is 
twice higher than the actual. This is not a problem as 
long as all other timings are measured relative to this 


frequency. Indeed, as Table 1 shows, all timing values 
measured by X-Ray are twice larger than the actual val- 
ues (except for Throughput ADD 132). 

The throughput of ADD 132 is quite interesting. Be- 
cause of the two integer ALUs and integer ADD latency 
of 0.5 cycles, we expect an effective throughput of 0.25 
cycles, which translates to 0.5 cycles relative to our fre- 
quency. Instead X-Ray measured 0.679, which is 50% 
greater (3 integer adds per cycle instead of 4). This 
problem occurs because the instruction cache on Pen- 
tium 4 can only deliver 3 instructions per cycle to the 
instruction dispatch engine, preventing the integer pipes 
from achieving the maximum throughput. Although we 
do not present the results here, X-Ray was able to mea- 
sure accurately the number of vector registers (MMX, 
SSE, and SSE2), as well as the latencies and through- 
puts of the corresponding SIMD instructions. 

Lmbech results are close to those of X-Ray but no- 
ticeably less accurate. However, Imbench found the ad- 
vertised frequency instead of double the value as X-Ray 
did. 


5.4. Itanium 2 


X-Ray produced accurate results for all parameters. 
Lmbench results were slightly less accurate, with one 
major problem — the throughput of ADD 132. This 
processor is able to execute 6 independent ADD oper- 
ations per cycle, and Imbench measured throughput of 
only 0.469. X-Ray measured the correct throughput of 
0.169. 

Measuring the number of F32 registers illustrates a 
different point. This processor has 128 floating-point 
registers but two of them are hardwired to 0.0 and 1.0. 
In spite of this, X-Ray concluded that the Itanium has 
128 available registers, because the average access time 
did not increase significantly until three or more vari- 
ables were spilled. Reducing the significance threshold 
used by X-Ray may permit a more accurate measure- 
ment but this increases sensitivity to noise. 


5.5. Athlon MP and Opteron 240 


X-Ray measured all CPU feature parameters accu- 
rately. Lmbench gave less accurate results, especially 
for instruction throughput. 

The memory hierarchy numbers for these machine 
are interesting because they expose the fact that the L1 
and L2 caches implement cache exclusion. Most plat- 
forms support cache inclusion, which means that infor- 
mation cached at a particular level of the memory hier- 
archy is also cached in all lower levels. AMD machines 
on the other hand use exclusion, so data never resides 


in both the L1 and L2 caches simultaneously. X-Ray 
classified the 512KB, 16-way associative L2 cache of 
the AthlonMP as an 18-way set-associative cache with 
a capacity of 576KB (exactly Cı + C2). Similarly on 
the Opteron 240, the 1MB L2 was classified as a 17-way 
set associative cache with an effective capacity 1088KB 
(exactly C1 + C2). If the actual capacity of the La cache 
is needed, it can be obtained by subtracting the capac- 
ity of the Lı cache, although the combined capacity is 
what is actually relevant for an self-optimizing code that 
wants to perform an optimization like cache tiling. 


5.6. Xtensa LX 


Xtensa LX is a configurable, extensible processor 
core designed by Tensilica. The hardware parameters 
of different Xtensa LX cores can be very different. This 
feature of the Xtensa LX processor makes it a challeng- 
ing target for X-Ray. 

Frequency 

The processor frequency measured by X-Ray 
(343.225MHz) was 2% different from the actual value 
(350MHz). This inaccuracy can be explained by the 
loop overhead incurred from the code shown in Fig- 
ure 3(d). The 256 case statements have a latency of 1 
cycle for a total of 256 cycles, and the loop-back code 
at the end has a total latency of 5 cycles. Therefore the 
measurement error is 5 + 261 ~ 2%. While we can par- 
tially compensate for this [8], we do not feel that this 
1s necessary because we only use frequency to measure 
other parameters relative to it (in clock cycles). 
Number of Registers 

X-Ray measured 11 integer registers and 17 floating- 
point registers, while there are 16 architecturally avail- 
able of both types. We verified that register spills oc- 
curred when using more than 11 integer variables or 
more than 16 floating-point variables. The cost of 
spilling one floating-point registers was not sufficient 
for X-Ray to declare that a phase transition had hap- 
pened. Of course, we could lower the threshold at which 
X-Ray declares that a phase transition has happened, 
but this might have a negative impact on other platforms 
where measurement noise is relatively high. In practice, 
it is likely that this performance penalty will not be sta- 
tistically significant even if the number of variables is 
one or two more than the number of available registers. 
We are also looking into more robust phase transition 
detection algorithms. 

Other Configurations 


e We introduced two more integer ADD and one 
more integer MULTIPLY functional units. X- 
Ray correctly measured the new Throughput ADD 


132 and Throughput MULTIPLY 132 as 0.335 and 
0.496 cycles respectively. 

e We changed the data cache configuration to 6KB, 
3-way set associative with 32 byte blocks. X-Ray 
correctly measured the new cache parameters. 

e We replaced the data cache with a 128-bit single 
precision fixed point SIMD unit. After the appro- 
priate descriptions were added, X-Ray correctly 
measured the latency and throughput of ADD and 
MULTIPLY, along with the number of vector reg- 
isters (1.000, 1.977, 0.998, 0.996, and 16 respec- 
tively). 


5.7. MIPS R4400 


X-Ray measured all parameters accurately. Lm- 
bench accurately measured all parameters it supports. 
There are two details worth noting. 


e The latency of MULTIPLY 132 measured by X- 
Ray is about 15 cycles, while the actual latency is 
12 cycles. The reason behind this mismatch is that 
the R4400 has special registers hi and lo, which 
hold the result of integer multiply. Therefore the 
code sequence we use (r0=r0x*r1) is translated 
to the assembly sequence (hi,lo) = r0 * rl; 
r0 = lo; noop; noop. The two noop in- 
structions are necessary because access to lo is 
asynchronous and the compiler needs to make sure 
that the value can be copied before it is destroyed. 
Therefore, although the latency of an integer mul- 
tiply is 12 cycles, it cannot be sustained by code. 

e X-Ray measured significantly fewer registers than 
are architecturally available. We examined the 
generated assembly files and confirmed that it is 
the policy of the native compiler to reserve the rest 
of the registers. 


6. Future Work 


We are actively developing new micro-benchmarks 
inside the X-Ray framework. Our current focus in- 
cludes measuring other parameters of the memory hier- 
archy such as parameters of instruction caches, and re- 
placement policy and bandwidth of different cache lev- 
els, as well as determining all bundles of instructions 
that can be issued in a single CPU cycle at a sustained 
rate. 


X-Ray can be downloaded at http://iss.cs. 
cornell.edu/Software/X-Ray.aspx. 
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